Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the SuperVectorizer and dirty_cat's encoders to the search space #169

Open
wants to merge 8 commits into
base: typed_data_terminals
Choose a base branch
from

Conversation

LilianBoulard
Copy link

@LilianBoulard LilianBoulard commented Aug 25, 2022

This PR aims at implementing dirty_cat's encoders (currently SimilarityEncoder, GapEncoder and MinHashEncoder) to GAMA's search space via the use of the SuperVectorizer.

The point of adding the dirty_cat encoders is for GAMA to be able to handle dirty categorical features in tabular data.

Using the SuperVectorizer gives a simplified interface to the sklearn's ColumnTransformer, and allows to mix & match different encoding techniques.

For the content of this PR to run, the features implemented in dirty_cat 0.3 are required. However, at the time of writing these lines (August 2022), this version is not out yet.

TODO:

  • wait for dirty_cat 0.3 to be out
  • fine-tune the preprocessing search space
  • benchmark GAMA to compare the performance before and after the introduction of the SuperVectorizer

@PGijsbers
Copy link
Member

Please give me a ping here as soon as dirty cat 0.3 is released :)

@LilianBoulard
Copy link
Author

Hi Pieter, dirty_cat 0.3 is out!

@PGijsbers
Copy link
Member

I allowed CI now, I'll try to have a closer look over this week and the next. I will probably do the 22.0.0 release without (since I was planning to do that today or tomorrow, as the current PyPI package is broken due to updated dependencies), so ignore the message about adding things to the changelog; I'll do that later when preparing for 22.1.0.

@PGijsbers
Copy link
Member

PGijsbers commented Sep 14, 2022

Ah, it looks like the unit tests which used pre-defined individuals are broken now (to be expected). I am not entirely sure how I want to fix that - that will depend on whether or not we want to allow for the old behavior to be used as an alternative, and that would depend on a small benchmark. So I don't think there's much you can do right now as far as improving the tests/code.

Running some additional experiments to define a sensible default search space, as noted in the OP, should be possible and is appreciated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants